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EXECUTIVE SUMMARY 


In this paper, it is contended that the threshold challenges that must be adequately 
addressed before Big Data sources can be used for the production of official statistics 
are the business case, the validity of statistical inference and data ownership and 
access issues. 


The business case comprises business needs and benefits, and data ownership and 
access issues are particularly important where, as is commonly the case, the National 
Statistical Office is not the custodian of the Big Data source. Above all, given the 
expected inferential biases from Big Data — due to under-coverage, self-selection, 
missing values etc. — statistical methods must be developed before Big Data sources 
can be harnessed for the production of official statistics. 


Using a Bayesian framework, this paper outlines necessary conditions — in particular, 
the Missing At Random condition — for valid statistical inference to be made for 
estimating or predicting finite population parameters (e.g. totals of population units), 
or for estimating the super-population parameters of statistical models (e.g. the 
regression coefficients of a linear regression model). 


By assuming that Missing At Random conditions are fulfilled, the paper also provides 
an illustrative theoretical method for utilising satellite imagery data to predict crop 
areas and crop yields. The analysis assumes that the data are described by a dynamic 
logistic model for crop types and a dynamic linear model for crop yields. The method 
relies on using “ground truth” data from a random sample to calibrate the satellite 
imagery, and using the latter as covariates to predict the data of interest for the 
population not included in the random sample. 


Finally, the paper outlines methods to address related statistical computing issues and 
proposes strategies for extending the model to provide a better fit to the observed 
data. 
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A STATISTICAL FRAMEWORK FOR ANALYSING BIG DATA 


Dr Siu-Ming Tam 
Chief Methodologist 
Australian Bureau of Statistics! 


ABSTRACT 


In this paper, it is contended that the threshold challenges that must be adequately 
addressed before Big Data sources can be used for the production of official statistics 
are the business case, the validity of statistical inference and data ownership and 


access issues. 


Using statistical modelling, the paper outlines necessary conditions for addressing the 
biases inherent in Big Data sources when estimating parameters of a finite population 
or super-population model. 


To illustrate the proposed statistical framework, the paper describes a method, based 
on State Space modelling, for utilising satellite imagery data to predict crop types and 
crop yields. The paper also outlines methods to address related statistical computing 
issues, and proposes strategies for extending the model to provide a better fit to the 
observed data. 


1 This paper was written in response to an invitation from the Editor of the Survey Statistician to provide some 
perspectives on the American Association for Public Opinion Research (AAPOR) Task Force Report on Big 
Data (Japec et al., 2015). 
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1. INTRODUCTION 


In a 2014 talk to the Victorian Branch of the Australian Statistical Society, Professor 
Terry Speed, an eminent mathematical statistician and winner of the 2014 Australian 
Prime Minister’s Science Award, expressed surprise about the lack of visibility of 
statisticians in the Big Data debate, and said “...the absence of statisticians in Big Data 
activities is striking (to a statistician)”. He also observed that there was generally lack 
of presence of statisticians in national and international conferences on Big Data. 


In an article entitled “Big Data or Big Fail? The Good, the Bad and the Ugly and the 
Missing Role of Statistics”, Iacus (2014) echoed Terry Speed’s point about the role 
statistics and statisticians can play in the field of Big Data. 


Against this background, I warmly welcome the well written and researched Report by 
the American Association for Public Opinion Research (AAPOR) Task Force (JJapec et 
al., 2015). The references provided in the Report would be very useful to statisticians 
who want to use Big Data or make a contribution to the Big Data debate. 


I particularly like the report’s comprehensiveness in raising the many different issues 
of Big Data, covering not only what it is and why it matters, but also the policy, 
technical and technology challenges facing users of Big Data in solving business 
problems or finding answers to societal questions. 


As a practising official statistician, I find Section 7 of the AAPOR Report very 
interesting, and in particular, Sub-section 7.3 about combining Big Data and Survey 
Data. I would therefore devote most of my comments on this issue. I would also 
outline the preliminary work undertaken in the Australian Bureau of Statistics (ABS) to 
investigate into the business case and validity of harnessing certain Big Data sources 
for the regular production of official statistics. 
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2. THRESHOLD CHALLENGES FOR BIG DATA 


Whilst the Report has outlined a number of key challenges for Big Data use and 
analysis, I would contend Business Case, using Big Data in statistically valid ways, i.e. 
Validity of Statistical Inference (page 22 of the Task Force Report) and Data 
Ownership (page 30) are the threshold challenges confronting official statisticians in 
the use of Big Data in the regular production of official statistics. 


In saying this, lam not downplaying the other challenges such as Data Stewardship, 
Data Collection Authority, Privacy and Re-identification. National Statistical Offices 
(NSOs) are generally well set up and have developed capability to address these 
challenges. For example, many statistical offices have already developed methods, 
processes and procedures to address privacy and confidentiality issues in their 


statistical releases — see, for example, the Special Issue of the Statistical Journal of the 


International Association of Official Statistics on “Official Statistics and Micro Data: 
Access and Confidentiality” released in 2009 — which may be adapted to address 
releases based on, or supplemented by, Big Data. A detailed discussion of the Big 
Data challenges faced by NSOs are provided in Tam and Clarke (2015a). My 
contention is that if the threshold challenges cannot be overcome, i.e. there is no 
business case for using a particular Big Data source, if the Big Data source cannot 


provide valid statistical inferences, and if the Big Data source is not available to official 


statisticians, there is no question of using the Big Data source in regular statistical 
production, and the other challenges do not arise. 
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3. BUSINESS CASE 


What is the Business Case of Big Data? Business case comprises business need — what 
business problems we want to solve and can Big Data be part of the solution — and 
business benefit — whether the benefit of Big Data as a solution does outweigh the 
costs? 


Being a collective term for a diverse range of data sources (page 5), the business case 
for Big Data does vary from source to source. For example, there is clearly a business 
case in the use of Administrative Data (page 9) by official statisticians in the 
production of official statistics, e.g. in the use of birth, death and migration records to 
complement the data from population censuses to provide contemporary population 
estimates. Cargo manifests are used to produce trade statistics. Without these 
sources, it will not be possible to provide population estimates or trade statistics. In 
other words, these sources provide valuable information to fill a data gap. 


However, I have heard of propositions such as “... let’s bring all the Big Data into our 
organisation and then figure out what we want to do with it. And to effectively do 
this, let’s upgrade our computer hardware, or software, because Big Data requires big 
data processing capabilities ...”. These propositions worry me as they put the cart 
(Big Data) before the horse (business problems) and treat “Big Data as a solution in 
search of a problem”. 


In my view, Big Data should only be used if it can: 


° improve the product offerings of statistical offices e.g. more frequent release of 
official statistics, more detailed statistics, more statistics for small population 
groups or areas, or filling an important data gap — business need; or 


° improve the cost efficiency in the production of official statistics — business 
benefit. 


The AAPOR Report rightly points out (page 15) that the “costs and risks of realising 
these (i.e. Big Data) benefits are non-trivial”. For example, in the case of satellite data, 
whilst the risk of not having access to the data is small given that most of these are 
available free of charge on the internet, the cost associated with creating the ground 
truth data and marrying them up with satellite data, at the observation unit e.g. a 
statistical local area the cost of storing, cleaning, processing, quality assuring and 
software development are substantial. In the case of the Australian Bureau of Statistics 
(ABS), while the business need for using satellite data, instead of direct data 
collection, to estimate crop areas and crop yields has been well established, the 
business benefit has yet to be assessed. 
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4. A POSSIBLE APPROACH TO USING BIG DATA 
FOR OFFICIAL STATISTICS 


An approach which has recently been actively pursued at the ABS (Tam and Clarke, 
2015b) for the use of Satellite data in official statistics production is to consider the 

N x1 vector of measurements, Y, , of interest to the official statistician, e.g. crop areas 
or yields, at time t as a realisation of a super-population model, with the Big Data 
augmented with non-Big Data sources, Z, , treated as a (design) matrix of covariates 
for the model, i.e. 


Y, = ZB, +e; (1) 
and allowing the vector of regression parameters, B,, to change over time, i.e. 
B. =H,B.i + - (2) 


Here N denotes the size of the finite population e.g. total number of land parcels. 
Equations (1) and (2) form the well-known State Space Model. Under this 
formulation, we consider that a sample, s, , of units is chosen, e.g. a sample of 
observation units at time t, on which observations of the value of Y,,, where ‘o’ 
denotes observed (or responding) units, are obtained. Denote by ‘m’ the units in s, 
on which there is no observation, i.e. missing data, and ‘r’, the units of s, not 

Mas Va) 


selected in the sample, then the vector Y, can be partitioned as Y, = (Yo; Ym Y; 


ot? 
State Space Models were used in Tam (1987) for predicting finite population 


parameters in finite population sampling. 


Assuming that we can match these observed units to the corresponding units in the 
Big Data source and non-Big Data sources available to the statistician e.g. geographic 
location (in a survey, the linkage is automatic through the questionnaire as a collection 
instrument), and as can be seen from diagram 5.1 below, for every unit in the sample, 

s, , one of the following two conditions will apply, namely, that there is a corresponding 
set of data from Big Data for the unit, and there is not. Denote by ‘B’ those units that 
have Big Data information, and ‘B’ those that don’t. Then (1) can be re-written as: 


Yost Logt Cont 

Yingt Zingt [met 

Yigt . Lat Crest 

yl ziz.. [Bt 3) 
Opt Opt cost 

Yingt Z mat emet 

Yiet Zrst | ast 
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Note that (3) can be extended to Generalised Linear Models and Generalised Linear 


Mixed Models — see the penultimate section of this paper. 


Let I,, R, and &, denote random variables representing sampling, response and Big 


Data under-coverage processes respectively. These are column vectors whose i-th 


element is given by 


Soe) , 6Rv and EF) respectively, which is ‘one’ if the i-th unit is 


in the sample, responded or covered in the Big Data respectively; and ‘zero’ 


otherwise. 


The inference problem under the model (2) and (3) can then be stated as follows: 


Le 


6 


The data for inference for the finite population, say the population total, 1'Y , at 


time t are 
DY = Naa NGPA tin a0 7 2 ses ee ny ey ere a 
and PO = pl VU pho 
where Pt? = {1,,Ry,--.1, Ry} 
and PY) ={@,...,2,}. 


Model-assisted methods (Sarndal et al. , 1992) and model-based methods 
(Chambers and Clark, 2012), including Bayesian methods (Puza, 2013), may be 
applied for making inference. 


Whatever method of inference is used, the official statistician needs to 
understand, or make assumptions, about the processes leading to the missing 
and non-sample data, i.e. how those highlighted in black in equation (3) come 
into being; Where missing at random conditions are not met (see Section 5 
below), modelling for the missing and non-sample selection processes have to 
made. For Big Data sources, this can be very challenging, if not insurmountable. 
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5. VALIDITY OF STATISTICAL INFERENCES 


I welcome the attempt by the Task Force to provide a total error framework for Big 
Data (page 18), and Couper (2013) provides a good description of the types of errors 
encountered in Big Data. 


I cannot agree more strongly with the Report that “... using Big Data in statistically 
valid ways is challenging and one misconception is the belief that the volume of the 
data can compensate for any other deficiency in the data (Big Data Hubris)” (page 22). 
Unlike sampling errors, non-sampling errors will not be reduced by increasing the 
sample size. Likewise, correlation is not the same as causality. In a recent article in 
Significance, entitled “Big Data, Big Mistake?”, Harford (2014) showed how such a 
misunderstanding can have fatal consequences. The Report’s reference to Fan ef al. 
(2014) is particularly valuable to those Big Data enthusiasts who believe that size is 
everything! 


To explore the conditions for validity of statistical inference, we will depict the 
relationship between a particular Big Data source (e.g. satellite imagery data) and the 
target population of interest (e.g. the land parcels) to the official statistician, in diagram 
5.1 below. As well, we will make the simplified (but not always true, e.g. social media 
data) assumption that the unit of interest in the target population will appear in the Big 
Data source, if at all, only once. This is to ensure the possibility of making an unique 
linkage between the Y value of a unit in the target population and the corresponding 
Z values from Big Data (and non-Big Data sources). (Note that if there are multiple 
appearances, an approach that may be adopted would be randomly choose one 
appearance where the appearances are homogeneous, include an additional covariate 
where there is structured heterogeneity, or use a repeated measures model (Denham 
et al., 2011) where the appearances are sufficiently heterogeneous.) 


The joint areas of the two big circles in diagram 5.1 are divided into three segments — 
under-coverage, i.e. information of interest to the official statistician but not available 
from Big Data; over-coverage, i.e. information available from Big Data that is of no 
interest; and finally, information of interest and available. Also, the ‘system’ can be 
described as comprising a data process, state process and censoring process, with 
prior distributions f(@), /(®) and /(o) with known hyper-parameters. 


Under the approach advocated in this paper, I assume that a probability sample (so as 
to fulfil the non-informative sampling conditions for descriptive and analytic inferences 
— see (6) and (10) below) is drawn from the population of interest, from which 
observations are made. These, combined with the corresponding Big Data for the 
same observation units, are used to provide the posterior distribution of the model 
parameters — the Estimation step. The resultant posterior distribution, together with 
the Big Data for the non-sampled units, are then used to predict the values of these 
units using the predictive distribution — the Prediction step. 
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5.1 Integrating designed data with found data 


Over-coverage, not relevant for inference At time =t, State Space Process 
comprising: 


Y,, Zit 
A, Data Process — f (Y,;9) 
Yt Tgt 
State Process — f (B,; 0) 
Big Data assumed to be Markovian. 
Inference ; : ; 
Population S Censoring Processes’ — . 
x f (1,6), f (R06), f(%,4) 
Parameter Models — 
Yost »Ymst Zone Zine Y, nt »Ynat Zo Smet f(@) ,f (0) ,f(o) 
Values of Y in units denoted Values of Y in units denoted 
by ‘B’ not available due to by ‘r’ not available due to the Data — pd”, pi 
under-coverage. sampling process, I,. 
Conceptualise missing values due —_—- Values of Y in units denoted Blue denotes observed/available. 
to Non-response Process, &, , by ‘m’ not available due to a 
applied to Big Data missing process applied to the 


sampled data, R,. 


5.1 Descriptive inferences 
Under a Bayesian framework, the predictive inference of Y,, f (¥, | D®, p®| , given 


the data D® and P“ — which I shall denote by Y, | D®, pO to simplify notation — 


is given by 
(t) Y Dp Pm |Ly, DW PS | 
Ly | po po | | 1 t 2*2 tw yee 
: ER DEY || DY PP | 
- Ee | DPS? | ; 
provided that Ee 1DOP | =| Pe Dek (4) 


Assuming further that the finite population sampling and non-response processes at 
time t, and t, are independent for t, # t, and 1,,T, =1,...,t, sufficient conditions 
for (4) to hold are 


[R.| 1,,Y,,D,,2, | a [R.| [Dee | 6) 
and [1.| ~ oe. = [1 D,2, | (6) 


FG 1a eo) EO 
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Equation (6) holds for probability sampling, and Equation (5) holds if the non- 
response mechanism is missing at random (MAR) (Rubin, 1976). See, for example, 
Little and Rubin (2002) for response process modelling in which MAR does not hold. 


Now, 
fe DOR [| EDO DOP? ape 
= f Po? | 10 DO|| 4 pe DO | dD© 
= f[y,.D®,D® J av® 
«|x |D°| 
provided that | Ps? | Ye DO] = Es 1D |, 7) 
where 
Dw” = tei Meaae ti Ct Za ah ee Dp ees iy SUNG Cory Ley anny 


represents the set of unobserved response variables and covariates in (3) for time 1 to 
time t. 


Assuming that the under-coverage ‘processes’ for Big Data at time 1, and t, are 
independent for t, 4 tT, and 1,,t, =1,...,t, sufficient conditions for (7) to hold are: 


Es YD;,Die) =| Rel YD: (8) 


where D,= LY, so Ay Le | 


Op?? 


and D 


TC = eee Yiatcr Yimgte? Yrercs Logte? Zingte? Zi : 


Note that [2.| Y.DsDie| = [R.| VADs] , for t=1,...,t , may be satisfied for 
certain Big Data sources e.g. administrative data, but not others e.g. data from social 
media where participation is self-selected. 


Where (4) and (7) are satisfied, Y, | DOE? o Be | D” | . In other words, the 


sampling, missing data and under-coverage processes can be ignored when making 
inference about Y, . 
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Where (7) is not fulfilled, predictive inferences for Big Data will have to be based on 


Y, | Dp, ps? | , which in turn requires modelling of PS”. 


Prediction with missing covariates can be a very challenging problem. See, for 
example, Chapter 4 of Wu (2010) for possible methods and references to tackle this 
issue. 


5.2 Analytic inferences 


The posterior distribution of the parameters, 8 and @, is given by 


D®, P® | ap 


8.) DP] = [fee,d® 


x f[P®| D®,D®,6,9 |[D®,D?,0,9] an? 
oc 8.9 D® | 
provided that pe | DY, D®, 8,9 | = Pe D6, 9| (9) 
Sufficient conditions for (9) to hold are 
E& D,D., PS, 6,9 | =| PR D”,P0,6,9 | (10) 
and Eg DD”, 6,9 | = Es D”,6,0| (11) 


Where (10) and (11) are satisfied, 
[8,9] D®,P] = [a9/D] <[D| 696 @][ol. 


Whilst the above is formulated under a Bayesian framework, I note that the data, Dp” 
and P , are ancillary for Y, or (8,@)’ under the assumptions laid out above. 

Under the conditionality principle, frequentist inference for Y, or (8,@)' should be 
based on holding the data, D® and po fixed (see, for example, Cox and Hinkley, 
1974, page 31). 
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6. DATA OWNERSHIP 


I also agree that data ownership and access is a key issue for NSOs and one where 
there is a generally lack of legislation and a supporting framework (page 30). The 
challenge is to unlock public good from privately collected data whilst protecting the 
commercial interests of the data custodians. 


In many cases, commercial value is placed on primary and derived non-government 
data sets by their owners, since either the provision of such data is the basis of their 
business, or its possession is a significant element of competitive advantage. This 
raises the issue of how the NSO might acquire commercially valuable or sensitive data 
for statistical production, particularly if the statistics compete directly with information 
products created by the data owner or they compromise its market position. This 
issue is made more complex by the fact that there may be several parties with some 
form of commercial right in relation to a data set, either through ownership, 
possession or licensing arrangements. 


Much Web content is also unstructured and ungoverned — the metadata describing its 
usage and provenance (origin, derivation, history, custody, and context) are either 
incomplete or incongruous. Indeed, the long-term reliability of Big Data sources may 
be an issue for ongoing statistical production. Reputable statistics for policy making 
and service evaluation are generally required for extended periods of time, often many 
years. However, large data sets from dynamic networks are volatile (and arguable 
static sources as well) — the data sources may change in character or disappear over 
time. This transience of data streams and sources does not sit comfortably with the 
reliability of statistical production and publication of meaningful time series. 


With more statistics potentially available from the Web subject to different levels of 
biases and measurement errors at different points in time, what guidance can 
statisticians provide to report, connect and compare these statisticians over time and 
between different sources? As a minimum, the statistical profession should encourage 
the dissemination of these statistics to be accompanied by relevant meta data, for 
example, in the form of quality declarations and in accordance with Quality 
Frameworks (ABS, 2010; Brackstone, 1999; OECD, 2011) widely adopted by official 
Statisticians. 
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7. A POSSIBLE ANALYSIS OF SATELLITE DATA 
TO PREDICT CROP YIELDS 


To illustrate the potential analysis being developed in the ABS, I shall assume that 
equations (5), (6) and (7) are fulfilled by satellite data. Equation (6) is satisfied by 
choosing a random sample of observation units and collecting (ground truth) data on 
crop yields — the data are then integrated with satellite data to provide the ‘training 
dataset’. Equation (7) is fulfilled as the coverage of satellite data is the same as the 
coverage for land parcels. Equation (5) may not hold for certain areas in Australia due 
to persistent cloud cover, as a result of moisture in the atmosphere, which may affect 
the type of crops being grown, or yields. This issue may, however, be by-passed by 
using traditional data collections e.g. statistical surveys, instead of using satellite data, 
for these areas. 


Let the Nx1 vectors M,, m, and Q, be the column vector of the crop yield, crop 
type and quantity harvestable respectively, for every observation unit of Australia. 
Then, M, = Q, *m, = Exp(Y,)*m, , where * denotes the Hadamard product, the 
Nx1 vector Exp(Y,) has exp(Yjq ) as its i-th element, Y¥; =logQ;, and Q,, is the 
i-th element of Q, . Under the MAR assumptions made above, we can ignore P® for 


predictive inference. That is, 


| ¥..m, | pee | ¥..m,| p | 


- [x] m0 Im |>°] 


By assuming m, and Y, | m, can be modelled by Dynamic Logistic Regression and 
Dynamic Linear models respectively, Tam and Clarke (2015b) provided results for the 
predictive distributions, | Y, | m,,D“ | and | m, | Do 


To illustrate the idea for predicting quantity, under the assumptions of this Section, 


Yor | _ | Zot “i Cot 2 
Yt 7 Zr Pe er ae 


in which we have dropped the subscript ‘B’ to simplify notation. See Section 7.2 


(3) becomes 


below for the choice of covariates and suggestions for improving the model in (12). 
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Assuming that 


Y, | Z,,B, ~ N(Z,B,,2;) 

B,= HBate , B12, 

By ~ N (Bog, } 

€, ~ independent N(0,Q,) , & Lp© (13) 


x 0 
and Q, and =, = a. are known, the predictive distribution of the total yield 
rrt 


of a particular crop (Tam and Clarke, 2015b) is 1,Q,; +1,Exp (Ye ) , where 


A 


Yur N(ZnBe Lert + Zi Zr) 


Re i eee | n 
Bee = AB, 1-1 + OQ ZorZoor (Ye, ¥ ZorB,—at-1) 


2 


‘it ’ ‘aay -1 
Que = (Qj a Zoe es 
OQ = HQ, -1),-1H’ +Q,. (14) 


Here Exp (0) denotes the vector with exp (Tia) as its i-th element, Y;,, is the i-th 


element of Y. 


rt> Byjp denotes the posterior mean of B, given D® , and Q,), is the 


variance-covariance matrix of Bye: 
Note the above methodology may be adapted to a ‘design-assisted’ approach (Sarndal 


et al., 1992) for estimating finite population parameters using the following heuristic 
argument. From (3), the Generalised Regression Estimator for the total yield, 1'Y, , is 


Cot (Y,) 3 {V'Z, ~ Cot (Z, yt Bor 
where €,,(Y,), €o¢(Z;,) are the Horvitz-Thompson estimators of Y, and Z, 


respectively, and Bor is the design based estimator of B, at time t (Sdrndal et al, 


1992). Following Wright (1983), even though By is not asymptotically design 


unbiased, we may use it for Bp, . 
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Likewise, denoting 0(Zi,y,) = [1 + exp (-ZirYe le as the logistic sigmoid for 


observation i at time t, and assuming m,, ~ independent Binomial Logistic 


(o(Ziy,)), or 
= baal 
0(Zn) 
Y= Wyte » WY LZ 


saa N(¥0,By, ) 


€, ~ independent N(0,E,) ,e, LD (15) 


a 
5 
a & 
Lt 
| 


where mg, =(my,,---,Mg,) and o(Z 7; ) = (o(Zi%),- 16 (Zin¥e)) etc. and &, is 


known, then (Tam and Clarke, 2015b) 


m,, | D“ ~ independent Binomial Logistic(o(Zir7q1)) (16) 
for unobserved units i=1,...,4 , 
where Yeh =H Ye-it—-1 + Pied 12 ft = Zi.8(ZoTj )} 
and Leer = Aaa + = - 


7.1 Statistical computing issues 


The examples shown above make the unrealistic assumptions that quantities like %, , 
Q, and &, are known. In reality they are not and have to be estimated by the 
observed data. To make the estimation task more manageable, one can consider 
modelling the unknown quantities as follows 


x, =A,(2)z 
Q, = i, (Q) Q 


E,=1,(@) 8 
where the scalars A,(Z), A,(Q) , 4,(&) > 0 follow an uninformative prior, 


) 


ies) 


Z~W' (Xo, Vs) ’ Q~W'(Qo,ve) ) B~W'(Zo,v 


and W! denotes the Inverse-Wishart distribution. 
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Let ©, = {0, >, @, A,(Z), A,(Q), A,CS), BQ, =} and may also include H if it is not 
known. Assuming (4) and (9) are fulfilled, then the posterior distribution of ©, , 


|, | Dp fe ae | @, |[©;] ; 


i.e. likelihood times the prior. ‘Maximum a posteriori’ estimates of @, can be derived 


using the EM algorithm — see Haykin (2001, Chapter 5) and also Strickland et al. 
(2009, 2011) for efficient estimation applied to satellite data. 


Alternatively, the predictive distribution, |M, | DS , where M, = E(Y,)*m, as 
before, can be derived using Monte Carlo via the method of composition as follows. 


From 


[Ym 0, 


pn” - [ ¥.,m, D”,@, |] @, od 


Fy m.0°.0,][m|>°.6,J[e,|D°). 


one can use the ZibBi software as outlined in Murray (2015) to draw J samples 


@:, ie ,@) from De | 2, |[®,] . Using these values and equations (14) and (16), we 


obtain samples Y;,...,¥/ from N(Z Be +ZQZir) and mt,...,m? from 


rrt 
| m, | D®, ©, | respectively, where the i-th element of the vector m, follows a 


Binomial Logistic Regression model with logistic sigmoid o(Ziy, ) , from which a 
sample of Mt, ...,M? can be obtained for Monte Carlo inference on M, | D“. 


Strickland et al. (2013) has also developed a Python package, pyMCMC, for fast 
multivariate state space modelling, which is scheduled for release in June, 2015. 


7.2 Choosing covariates and improving the model fit 


There is a huge literature in predicting crop yields, see for example, Johnson (2014) 
and the references therein. A review of the methodology is provided in Lobell (2013). 
Based on the science of crops, most of these use the Normalised Difference 
Vegetation Index (NDVJ), or Enhanced Vegetation Index (EVI), which are simple 
functions of the near-infrared radiation and visible radiation, and other variables like 
soil moisture, land surface temperature etc. available from satellites and other sources 
are included as covariates. Stress Index as a covariate derived from thermal time and 
crop phenology both from remote sensing (Idso et al. , 1981; Jackson et al., 1983; 
Rodriguez et al., 2005) as well as directly modelled from a biophysical crop model 
(Potgieter et al., 2005; Potgieter and Hammer, 2006) has been proposed. In addition, 
evapotranspiration derived from EVI and Global Vegetation Moisture Index has been 


ABS ¢ A STATISTICAL FRAMEWORK FOR ANALYSING BIG DATA * 1351.0.55.056 15 


suggested as covariates (Guerschman et al., 2009). These covariates can be 
incorporated in an obvious way into the State Space Model described above, although 
care has to be exercised to ensure there is no collinearity issue, or model over-fitting. 


Becker-Reshef et a/. (2010) fitted a simple regression model using county yield 
statistics as response variables and NDVI as explanatory variables, and use it to predict 
yields. Newlands et al. (2014) extends this work by employing a multivariate 
regression model using NDVI and agro-climate data as covariates. In addition, their 
model also allows a lag-1 autoregressive term for crop yields and the coefficients to 
vary over time and space, although no stochastic relationships between these 
coefficients were exploited. Priors on the parameters of the multivariate regression 
model were constructed using residual bootstrapping (Bornn and Zidek, 2012). The 
State Space Modelling advocated in this paper can be regarded as an extension of the 
methodology developed by Newlands et al. (2014). 


Where the model defined (13) does not adequately predict crop quantities, the 
following model may be considered: 


Y | Z,,B,,F, a, ~ N (Fa, +Z.B, soe} 


a,| | Ha, Pa 
i] =f]. %] a. 12 
B, | H2B.-1 fot 


€ 
ae “*|~ independent N (0,Q, ) » & aD 
| &2¢ 
Q 0 
0, = lt 
0 Q, 


In other words, the time-variant fixed effects, Fa, , is used to ‘sweep’ up any missing 
covariates in the modelling. This approach is akin to using random slopes in multi- 
level modelling (Snijders and Bosker, 1999 — Chapter 5) and is also known as 
Generalised Linear Mixed Model. A similar approach may be adopted for the model 
described in equation (14). The suggested approach, however, would require large 
sample sizes, as well as longer time series for accurate and precise estimation. 
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7.3 Concluding remarks 


In developing the above models and building the training data set for analyses, I found 
that I have to involve crop scientists (or more generally “domain experts” — page 26 of 
the AAPOR Report), statisticians and computer scientists, supporting the comment 
that a multi-disciplinary team is required to harness opportunities, and addressing 
challenges, from Big Data. New skill sets are required to integrate ground truth data 
with satellite data. 


Recommendation 1 of the AAPOR Report (page 2) says: 


“Survey and Big Data are complementary data sources and not competing data sources. 
There are differences between the approaches, but this should be seen as an advantage 


rather than a disadvantage”. 


This paper outlines an approach to combine the strength of Big Data with survey data 
— which has been regarded as the ‘gold standard’ for collecting data to make valid 
statistical inference — for predicting crop yields. The basic ideas are to use the Big 
Data and other auxiliary sources to calibrate the response variables, and to apply State 
Space Modelling to solve finite population inference problems. However, this is 
possible because the population covers by satellite imagery is identical to the 
population of land parcels, and the missing covariates problem is by-passed by relying 
on the traditional survey methods of estimation in those areas without satellite data 
e.g. missing data due to clouds. The efficacy of the approach will be tested using the 
training data set that is being built in the ABS. I hope to be able to report the 
outcome of the analyses, successful or otherwise, in the future elsewhere. 


Once again, I congratulate the AAPOR Task Force for providing an excellent Report. 
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FOR MORE INFORMATION ... 


INTERNET 


LIBRARY 


www.abs.gov.au The ABS website is the best place for data 
from our publications and information about the ABS. 


A range of ABS publications are available from public and tertiary 
libraries Australia wide. Contact your nearest library to determine 
whether it has the ABS statistics you require, or visit our website 

for a list of libraries. 


INFORMATION AND REFERRAL SERVICE 


PHONE 


EMAIL 


FAX 


POST 


Our consultants can help you access the full range of information 
published by the ABS that is available free 

of charge from our website, or purchase a hard copy publication. 
Information tailored to your needs can also be requested as a 
‘user pays' service. Specialists are on hand to help you with 
analytical or methodological advice. 


1300 135 070 
client.services@abs. gov.au 
1300 135 211 


Client Services, ABS, GPO Box 796, Sydney NSW 2001 


FREE ACCESS TO STATISTICS 


WEB ADDRESS 


All statistics on the ABS website can be downloaded free of 
charge. 


Www.abs.gov.au 
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